Disease Prediction

Yueyuan He

04/03/2020

Business Understanding

We use personal information in this dataset to predict whether or not a patient has a certain unspecified disease.

Dataset

Attributes’ information about the dataset (Disease Prediction Training.csv):

  • Age: in years
  • Gender: male/female
  • Height: in unit of cm
  • Weight: in unit of kg
  • Low Blood Pressure: lower bound of blood pressure measurement
  • High Blood Pressure: higher bound of blood pressure measurement
  • Cholesterol: three cholesteral levels
  • Glucose: three glucose levels
  • Smoke: 1/0 regarding if the patient smokes
  • Alcohol: 1/0 regarding if the patient drinks alcohol
  • Exercise: 1/0 regarding if the patient exercises regularly
  • Disease: The binary target variable. Does the patient have the disease?

Data Exploration

Data Importing

Library Import

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.svm import SVC
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import model_selection

from sklearn.model_selection import train_test_split

from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import KFold
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

Data Import

In [2]:
df = pd.read_csv("Disease_Prediction_Training.csv")
data = pd.read_csv("Disease_Prediction_Training.csv")
In [3]:
data.head()
Out[3]:
Age Gender Height Weight High Blood Pressure Low Blood Pressure Cholesterol Glucose Smoke Alcohol Exercise Disease
0 59 female 167 88.0 130 68 normal normal 0 0 1 0
1 64 female 150 71.0 140 100 normal normal 0 0 0 1
2 41 female 166 83.0 100 70 normal normal 0 1 1 0
3 50 male 172 110.0 130 80 normal normal 1 0 1 0
4 39 female 162 61.0 110 80 high high 0 0 1 0

Explore Data Analysis

Data Size and Types

In [4]:
data.shape
Out[4]:
(49000, 12)

We can see this dataset has 49000 records, 12 columns

In [5]:
data.dtypes
Out[5]:
Age                      int64
Gender                  object
Height                   int64
Weight                 float64
High Blood Pressure      int64
Low Blood Pressure       int64
Cholesterol             object
Glucose                 object
Smoke                    int64
Alcohol                  int64
Exercise                 int64
Disease                  int64
dtype: object

Data Missing Value

In [6]:
data.isnull().sum()
Out[6]:
Age                    0
Gender                 0
Height                 0
Weight                 0
High Blood Pressure    0
Low Blood Pressure     0
Cholesterol            0
Glucose                0
Smoke                  0
Alcohol                0
Exercise               0
Disease                0
dtype: int64

After checking missing value, we can see there is no missing value in this dataset, but we still need to explore every data is right and reasonable.

Data Value Analysis

In [7]:
data.describe()
Out[7]:
Age Height Weight High Blood Pressure Low Blood Pressure Smoke Alcohol Exercise Disease
count 49000.000000 49000.000000 49000.000000 49000.000000 49000.000000 49000.000000 49000.000000 49000.000000 49000.000000
mean 52.853306 164.366878 74.190527 128.698939 96.917367 0.088265 0.054245 0.803204 0.499959
std 6.763065 8.216637 14.329934 147.624582 200.368069 0.283683 0.226503 0.397581 0.500005
min 29.000000 55.000000 10.000000 -150.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 48.000000 159.000000 65.000000 120.000000 80.000000 0.000000 0.000000 1.000000 0.000000
50% 53.000000 165.000000 72.000000 120.000000 80.000000 0.000000 0.000000 1.000000 0.000000
75% 58.000000 170.000000 82.000000 140.000000 90.000000 0.000000 0.000000 1.000000 1.000000
max 64.000000 207.000000 200.000000 14020.000000 11000.000000 1.000000 1.000000 1.000000 1.000000

From data.describe, we can see the mean of age is about 53, the youngest person is 29 years old, and the oldest person is 64 years old. The mean of Height is 164.4cm, the shortest person is 55cm, however according to wiki records, the world's shortest person is 54.6cm, so it is possible that so many people below normal height.

In [8]:
fig = px.box(data, y="Height")
fig.show()
In [9]:
data[data.Height<60]
Out[9]:
Age Gender Height Weight High Blood Pressure Low Blood Pressure Cholesterol Glucose Smoke Alcohol Exercise Disease
6609 50 female 59 57.6 125 67 normal normal 0 0 0 0
10944 64 female 55 81.0 130 90 normal normal 0 0 1 1
45588 51 female 57 61.0 130 90 normal normal 0 0 1 1
In [10]:
fig = px.box(data, y="Weight")
fig.show()

Like the issues of height, there is also no possible that exists a person who is 10 kg weight. According to wiki records, the skinniest woman alive currently is 26 kg. Therefore, I decide to drop that the weights are below 30 kg.

In [11]:
data[data.Weight<30]
Out[11]:
Age Gender Height Weight High Blood Pressure Low Blood Pressure Cholesterol Glucose Smoke Alcohol Exercise Disease
7496 51 female 171 29.0 110 70 high normal 0 0 1 1
12534 51 male 165 10.0 180 1100 high high 0 0 1 1
16903 63 female 157 23.0 110 80 normal normal 0 0 1 0
19626 40 male 128 28.0 120 80 normal normal 0 0 1 0
41338 59 male 178 11.0 130 90 normal normal 0 0 1 1
47626 55 male 177 22.0 120 80 normal normal 1 1 1 0
In [12]:
fig = px.box(data, y="Low Blood Pressure")
fig.show()
In [13]:
data[data["Low Blood Pressure"]<200]
Out[13]:
Age Gender Height Weight High Blood Pressure Low Blood Pressure Cholesterol Glucose Smoke Alcohol Exercise Disease
0 59 female 167 88.0 130 68 normal normal 0 0 1 0
1 64 female 150 71.0 140 100 normal normal 0 0 0 1
2 41 female 166 83.0 100 70 normal normal 0 1 1 0
3 50 male 172 110.0 130 80 normal normal 1 0 1 0
4 39 female 162 61.0 110 80 high high 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ...
48995 56 female 163 70.0 100 80 normal normal 0 0 1 1
48996 48 female 154 60.0 120 80 normal normal 0 0 1 0
48997 40 female 180 79.0 120 80 normal normal 0 0 1 0
48998 60 female 155 70.0 150 100 too high normal 0 0 1 0
48999 53 female 169 98.0 120 70 normal normal 0 0 1 0

48341 rows × 12 columns

In [14]:
fig = px.box(data, y="High Blood Pressure")
fig.show()
In [15]:
data[data["High Blood Pressure"]<90]
Out[15]:
Age Gender Height Weight High Blood Pressure Low Blood Pressure Cholesterol Glucose Smoke Alcohol Exercise Disease
409 53 male 158 61.0 12 80 too high too high 0 0 1 0
441 40 male 188 96.0 80 130 normal normal 0 1 1 0
905 48 female 98 45.0 12 80 normal normal 0 0 1 0
1018 42 female 152 60.0 11 80 normal normal 0 0 1 0
1302 49 male 172 68.0 12 80 normal normal 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ...
47985 61 female 162 72.0 80 140 too high too high 0 0 1 1
48165 58 female 155 69.0 13 90 normal too high 0 0 1 1
48233 50 female 155 76.0 70 15 normal normal 0 1 1 1
48316 40 female 150 42.0 80 50 high high 0 0 0 0
48714 52 female 170 85.0 15 90 normal normal 0 0 1 1

233 rows × 12 columns

Data Outlier Detection

In [16]:
data["High Blood Pressure"].unique()
Out[16]:
array([  130,   140,   100,   110,   120,   150,    90,   160,   155,
         145,   180,  1409,   200,   169,   190,   170,   105,   125,
         126,   128,   153,   124,    12,    80,   141,   220,   115,
         134,   135,    99,   165,   172,    11,   191,   133, 13010,
          95,   132,   147,   210,   119,   144,    85,    14,   148,
         187,   163,   103,   143,   121,   117,   127,   175,   137,
         138,   139,   118,  1420,   146,   101,   113,   185,   149,
         108,   151,    93,   131,   168,   129, 11020,   123,   106,
        1400,   176,   156,    13,    15,   162,   122,   111,   230,
         116,   171,    97,   907,   240,   161,   112,   154,   166,
         109,     1,   152,   136,   159,   158,   142,  1130, 14020,
         178,   157,   164,    70,   174,   188,   104,   167,   114,
        1202,    20,   179,   960,   102,   196,  -140,    10,   906,
         107,  2000,  1500,   207,   701,  -120,   177,    17,  -100,
         181,  -150,  -115,  1110,  1300,   202,   215,   195,   401,
         902,   199,    16,   173,   909,    96,  1205,     7])
In [17]:
data[(data["High Blood Pressure"]>1000) &(data["High Blood Pressure"]<2000) ]
Out[17]:
Age Gender Height Weight High Blood Pressure Low Blood Pressure Cholesterol Glucose Smoke Alcohol Exercise Disease
40 46 male 164 66.0 1409 90 normal normal 0 0 1 1
3674 64 female 165 67.0 1420 80 high normal 0 0 1 1
5603 57 male 170 78.0 1400 90 high normal 0 0 1 0
12182 56 male 175 78.0 1130 90 normal normal 0 0 1 1
15720 57 female 160 60.0 1202 80 normal normal 0 0 1 1
23523 43 female 168 72.0 1500 80 normal normal 0 0 1 1
33156 63 female 153 63.0 1110 80 normal normal 0 0 0 1
35761 60 male 166 73.0 1300 90 normal normal 0 1 1 0
45156 46 female 168 69.0 1205 90 normal normal 0 0 0 1
In [18]:
data["Low Blood Pressure"].unique()
Out[18]:
array([   68,   100,    70,    80,    90,    60,    91,  1000,   120,
          61,    79,   104,    95,    86,    59,    69,   130,  1100,
          83,    85,   160,   140,  1008,    89,   110,  1177,    75,
         106,    99,    57,    65,    94,   105,    76,    77,    82,
          96,  1007,    78, 11000,   101,    84,   112,   107,   180,
          72,    40,    20,    74,    67,  5700,    50,    81,    62,
          88,  1200,   109,   710,   802,    73,    64,    92,    98,
          56,  9100,  1125,  1120,  1110,   103,     0,  9011,   820,
          71,   150,    97,    93,  8100,     6,   700,  1088,    87,
       10000,   108,  1003,  1002,    10,    66,   190,  1101,   809,
         111,     8,    30,   170,  8099,  8000,    63,  9800,   102,
         709,   900,  7100,   182,   850,  1001,  1033,   115,  1139,
         801,  1044,   119,    45,   800,   708,  8077,    58,   585,
        1111,  1011,    55,   870,   126,  1022,   125,   810,  8079,
        1900,   113,    52,   135,    49,     7,   121,  2088,    53,
         910,   880,   122,  1400,  8044,   602,  1077,    15])
In [19]:
def blood_pressure(x):
    if x < 0:
        x = abs(x)
    elif x > 0 and x <30:
        x = x * 10
    elif x> 300 and x <= 2000:
        x = int(x/10)
    elif x > 2000:
        x = int(x/100)
    else:
        x = x

    return x 
In [20]:
data["Low Blood Pressure"] = data["Low Blood Pressure"].apply(blood_pressure)
In [21]:
data["Low Blood Pressure"].unique()
Out[21]:
array([ 68, 100,  70,  80,  90,  60,  91, 120,  61,  79, 104,  95,  86,
        59,  69, 130, 110,  83,  85, 160, 140,  89, 117,  75, 106,  99,
        57,  65,  94, 105,  76,  77,  82,  96,  78, 101,  84, 112, 107,
       180,  72,  40, 200,  74,  67,  50,  81,  62,  88, 109,  71,  73,
        64,  92,  98,  56, 111, 103,   0, 150,  97,  93, 108,  87,  66,
       190,  30, 170,  63, 102, 182, 115, 113, 119,  45,  58,  55, 126,
       125,  52, 135,  49, 121,  20,  53, 122])
In [22]:
data["High Blood Pressure"] = data["High Blood Pressure"].apply(blood_pressure)
In [23]:
data["High Blood Pressure"].unique()
Out[23]:
array([130, 140, 100, 110, 120, 150,  90, 160, 155, 145, 180, 200, 169,
       190, 170, 105, 125, 126, 128, 153, 124,  80, 141, 220, 115, 134,
       135,  99, 165, 172, 191, 133,  95, 132, 147, 210, 119, 144,  85,
       148, 187, 163, 103, 143, 121, 117, 127, 175, 137, 138, 139, 118,
       142, 146, 101, 113, 185, 149, 108, 151,  93, 131, 168, 129, 123,
       106, 176, 156, 162, 122, 111, 230, 116, 171,  97, 240, 161, 112,
       154, 166, 109,  10, 152, 136, 159, 158, 178, 157, 164,  70, 174,
       188, 104, 167, 114, 179,  96, 102, 196, 107, 207, 177, 181, 202,
       215, 195,  40, 199, 173])
In [24]:
data.shape
Out[24]:
(49000, 12)
In [25]:
data[data["High Blood Pressure"] < data["Low Blood Pressure"]]
Out[25]:
Age Gender Height Weight High Blood Pressure Low Blood Pressure Cholesterol Glucose Smoke Alcohol Exercise Disease
441 40 male 188 96.0 80 130 normal normal 0 1 1 0
671 48 male 173 78.0 90 140 normal normal 0 1 1 1
3359 52 female 170 68.0 100 200 normal normal 0 0 1 0
3378 55 male 170 82.0 90 140 high normal 0 1 1 1
7236 63 male 170 65.0 90 140 normal normal 0 0 1 1
... ... ... ... ... ... ... ... ... ... ... ... ...
46815 44 male 170 66.0 120 200 normal normal 0 0 1 1
47530 59 female 160 83.0 80 120 normal normal 0 0 1 0
47759 46 male 159 68.0 70 100 normal normal 0 0 0 0
47985 61 female 162 72.0 80 140 too high too high 0 0 1 1
48233 50 female 155 76.0 70 150 normal normal 0 1 1 1

94 rows × 12 columns

In [26]:
high = data['High Blood Pressure']
data.insert(6,'High Blood Pressure_copy',high)
In [27]:
data["High Blood Pressure"][data['High Blood Pressure']<data['Low Blood Pressure']]=data['Low Blood Pressure']
/Users/mark/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [28]:
data["Low Blood Pressure"][data['Low Blood Pressure']>data['High Blood Pressure_copy']]=data['High Blood Pressure_copy']
/Users/mark/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [29]:
data = data.drop('High Blood Pressure_copy',axis=1)
In [30]:
data["High Blood Pressure"][data['High Blood Pressure']<120]=120
data["Low Blood Pressure"][data['Low Blood Pressure']>90]=90
/Users/mark/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/Users/mark/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [31]:
#data = data[(data["High Blood Pressure"] >= 120) & (data["Low Blood Pressure"] >0) & (data["Low Blood Pressure"] <=90)]
data = data[data['Low Blood Pressure'] > 0]
In [32]:
data
Out[32]:
Age Gender Height Weight High Blood Pressure Low Blood Pressure Cholesterol Glucose Smoke Alcohol Exercise Disease
0 59 female 167 88.0 130 68 normal normal 0 0 1 0
1 64 female 150 71.0 140 90 normal normal 0 0 0 1
2 41 female 166 83.0 120 70 normal normal 0 1 1 0
3 50 male 172 110.0 130 80 normal normal 1 0 1 0
4 39 female 162 61.0 120 80 high high 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ...
48995 56 female 163 70.0 120 80 normal normal 0 0 1 1
48996 48 female 154 60.0 120 80 normal normal 0 0 1 0
48997 40 female 180 79.0 120 80 normal normal 0 0 1 0
48998 60 female 155 70.0 150 90 too high normal 0 0 1 0
48999 53 female 169 98.0 120 70 normal normal 0 0 1 0

48988 rows × 12 columns

In [33]:
fig = px.histogram(data, x="Cholesterol")
plt.figure(figsize=(20,10))
fig.show()
<Figure size 1440x720 with 0 Axes>
In [34]:
fig = px.histogram(data, x="Glucose")
plt.figure(figsize=(20,10))
fig.show()
<Figure size 1440x720 with 0 Axes>

Data Visualization

In [35]:
fig = px.histogram(data, x="Gender")
fig.show()
In [36]:
fig = px.histogram(data, x="Height")
fig.show()
In [37]:
fig = px.histogram(data, x="Weight")
fig.show()
In [38]:
fig = px.histogram(data, x="High Blood Pressure")
fig.show()
In [39]:
fig = px.histogram(data, x="Low Blood Pressure")
fig.show()

Summary

This dataset doesn't have any missing value, but its bloodpressure has a lot of problems, some of them are over reasonable value, so I used apply function in pandas to make them more sense, and set the highest value of LowBloodPressure is 90, the lowest value of HighBloodPressure is 120 to make sure bloodpressure is in a reasonable and scientific range.

Data Preprocessing

Catorical to Numerical

In [40]:
data.Gender.unique()
Out[40]:
array(['female', 'male'], dtype=object)
In [41]:
data = data.join(pd.get_dummies(data.Gender,prefix='Gender'))
In [42]:
data.Cholesterol.unique()
Out[42]:
array(['normal', 'high', 'too high'], dtype=object)
In [43]:
data = data.join(pd.get_dummies(data.Cholesterol,prefix='Cholesterol'))
In [44]:
data.Glucose.unique()
Out[44]:
array(['normal', 'high', 'too high'], dtype=object)
In [45]:
data = data.join(pd.get_dummies(data.Glucose,prefix='Glucose'))
In [46]:
data
Out[46]:
Age Gender Height Weight High Blood Pressure Low Blood Pressure Cholesterol Glucose Smoke Alcohol Exercise Disease Gender_female Gender_male Cholesterol_high Cholesterol_normal Cholesterol_too high Glucose_high Glucose_normal Glucose_too high
0 59 female 167 88.0 130 68 normal normal 0 0 1 0 1 0 0 1 0 0 1 0
1 64 female 150 71.0 140 90 normal normal 0 0 0 1 1 0 0 1 0 0 1 0
2 41 female 166 83.0 120 70 normal normal 0 1 1 0 1 0 0 1 0 0 1 0
3 50 male 172 110.0 130 80 normal normal 1 0 1 0 0 1 0 1 0 0 1 0
4 39 female 162 61.0 120 80 high high 0 0 1 0 1 0 1 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48995 56 female 163 70.0 120 80 normal normal 0 0 1 1 1 0 0 1 0 0 1 0
48996 48 female 154 60.0 120 80 normal normal 0 0 1 0 1 0 0 1 0 0 1 0
48997 40 female 180 79.0 120 80 normal normal 0 0 1 0 1 0 0 1 0 0 1 0
48998 60 female 155 70.0 150 90 too high normal 0 0 1 0 1 0 0 0 1 0 1 0
48999 53 female 169 98.0 120 70 normal normal 0 0 1 0 1 0 0 1 0 0 1 0

48988 rows × 20 columns

In [47]:
data = data.drop(['Gender','Cholesterol','Glucose'],axis=1)
data
Out[47]:
Age Height Weight High Blood Pressure Low Blood Pressure Smoke Alcohol Exercise Disease Gender_female Gender_male Cholesterol_high Cholesterol_normal Cholesterol_too high Glucose_high Glucose_normal Glucose_too high
0 59 167 88.0 130 68 0 0 1 0 1 0 0 1 0 0 1 0
1 64 150 71.0 140 90 0 0 0 1 1 0 0 1 0 0 1 0
2 41 166 83.0 120 70 0 1 1 0 1 0 0 1 0 0 1 0
3 50 172 110.0 130 80 1 0 1 0 0 1 0 1 0 0 1 0
4 39 162 61.0 120 80 0 0 1 0 1 0 1 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48995 56 163 70.0 120 80 0 0 1 1 1 0 0 1 0 0 1 0
48996 48 154 60.0 120 80 0 0 1 0 1 0 0 1 0 0 1 0
48997 40 180 79.0 120 80 0 0 1 0 1 0 0 1 0 0 1 0
48998 60 155 70.0 150 90 0 0 1 0 1 0 0 0 1 0 1 0
48999 53 169 98.0 120 70 0 0 1 0 1 0 0 1 0 0 1 0

48988 rows × 17 columns

Split Data

In [48]:
X = data.drop(['Disease'],axis = 1)
y = data['Disease']
In [49]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

Standarize Data

In [50]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_std = scaler.transform(X_train)
X_test_std = scaler.transform(X_test)

Summary

In this section, I dealt with catorical data, such as gender, cholesterol and glucose, to change them into numerical data, then split data into training data and testing data with 0.67, 0.33, respectively. Last, in order to make the value of all features on the target more balanced, I standarized dataset for modeling.

Model

In [51]:
def plot_roc_auc(labels, predict_prob):
    false_positive_rate,true_positive_rate,thresholds=roc_curve(labels, predict_prob)
    roc_auc=auc(false_positive_rate, true_positive_rate)
    plt.title('ROC')
    plt.plot(false_positive_rate, true_positive_rate,'b',label='AUC = %0.4f'% roc_auc)
    plt.legend(loc='lower right')
    plt.plot([0,1],[0,1],'r--')
    plt.ylabel('TPR')
    plt.xlabel('FPR')
    plt.show()

Naive Bayes Classifier

GaussianNB

In [52]:
gnb = GaussianNB()
gnb.fit(X_train_std, y_train)
Out[52]:
GaussianNB(priors=None, var_smoothing=1e-09)
In [53]:
gnb_pred = gnb.predict(X_test_std)
print(f"Accuracy: {round(metrics.accuracy_score(y_test, gnb_pred)*100, 2)}%")
Accuracy: 68.13%
In [54]:
plot_roc_auc(y_test, gnb_pred)

MultinomialNB

In [55]:
gnb = MultinomialNB()
gnb.fit(X_train, y_train)
gnb_pred = gnb.predict(X_test)
print(f"Accuracy: {round(metrics.accuracy_score(y_test, gnb_pred)*100, 2)}%")
Accuracy: 70.49%
In [56]:
plot_roc_auc(y_test, gnb_pred)

Tuning Model

In [142]:
best_gnb_roc_auc = 0
Max_accuracy = 0
for i in range(4000,5000,100):
    gnb = MultinomialNB(alpha=i)
    gnb.fit(X_train, y_train)
    gnb_pred = gnb.predict(X_test)
    false_positive_rate,true_positive_rate,thresholds=roc_curve(y_test, gnb_pred)
    roc_auc=auc(false_positive_rate, true_positive_rate)
    best_gnb_roc_auc = max(best_gnb_roc_auc,roc_auc)
    print(i)
    print(f"Accuracy: {round(metrics.accuracy_score(y_test, gnb_pred)*100, 2)}%")
    Max_accuracy = max(Max_accuracy,round(metrics.accuracy_score(y_test, gnb_pred)*100, 2))
print("Best accuracy for Tuning Model is:", Max_accuracy)
print("Best AUC for Tuning Model is:",best_gnb_roc_auc)
4000
Accuracy: 71.34%
4100
Accuracy: 71.34%
4200
Accuracy: 71.38%
4300
Accuracy: 71.39%
4400
Accuracy: 71.39%
4500
Accuracy: 71.38%
4600
Accuracy: 71.39%
4700
Accuracy: 71.37%
4800
Accuracy: 71.38%
4900
Accuracy: 71.36%
Best accuracy for Tuning Model is: 71.39
Best AUC for Tuning Model is: 0.7142528188548581

Summary

First, I tried to use gaussian naive bayes to predict the target, I can see the dataset is not a normarized data distribution from data visualization. Therefore I used multinomial naive bayes, it is suitable for classification with discrete features, it has 3 parameters, I found a value to adjust alpha which is to smooth parameter, according to auc, the best model can arrive 71.26% acurracy.

In [145]:
best_nb = MultinomialNB(alpha=4700)
best_nb.fit(X_train, y_train)
Out[145]:
MultinomialNB(alpha=4700, class_prior=None, fit_prior=True)

KNN

No Standarize Data

In [59]:
classifier = KNeighborsClassifier()
classifier.fit(X_train, y_train)
Out[59]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')
In [60]:
y_pred = classifier.predict(X_test)
print(f"Accuracy: {round(metrics.accuracy_score(y_test, y_pred)*100, 2)}%")
Accuracy: 69.2%
In [61]:
plot_roc_auc(y_test, y_pred)

Standarize Data

In [62]:
std_classifier = KNeighborsClassifier()
classifier.fit(X_train_std, y_train)
Out[62]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')
In [64]:
knn_pred = classifier.predict(X_test_std)
print(f"Accuracy: {round(metrics.accuracy_score(y_test, y_pred)*100, 2)}%")
plot_roc_auc(y_test, knn_pred)
Accuracy: 69.2%

Tuning Model

In [65]:
#cv = ShuffleSplit(n_splits=100, test_size=0.3, random_state=16)
cv = KFold(n_splits=10, shuffle=True, random_state=10)
from sklearn.model_selection import GridSearchCV
knn_param_grid = {'n_neighbors': range(5, 50,5),
              'leaf_size': range(5, 50,5)}
knn_grid = GridSearchCV(KNeighborsClassifier(), knn_param_grid,cv=cv)
best_knn_pred = knn_grid.fit(X_train_std, y_train)
best_knn_pred = best_knn_pred.predict(X_test_std)
In [66]:
plot_roc_auc(y_test,best_knn_pred)
In [67]:
knn_grid.best_params_
Out[67]:
{'leaf_size': 30, 'n_neighbors': 30}

Summary

I used data with standarize and without standarize, and then used gridsearch to find the best parameters. Finally, the auc of the best model is 0.7264

In [137]:
best_knn = KNeighborsClassifier(leaf_size=5, n_neighbors=45)
best_knn.fit(X_train_std,y_train)
Out[137]:
KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=45, p=2,
                     weights='uniform')

SVM

Linear kernel

In [69]:
lin_clf = svm.LinearSVC(dual=False)
lin_clf
Out[69]:
LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)
In [70]:
lin_clf.fit(X_train_std,y_train)
lin_clf_pred = lin_clf.predict(X_test_std)
In [71]:
plot_roc_auc(y_test, lin_clf_pred)

Tuning linear model

In [72]:
bs = model_selection.ShuffleSplit(n_splits=15,test_size=0.3,random_state=0)
param_grid = {'C':[0.001,0.01,0.05,0.1,0.25,0.5],'penalty':['l2']}

gridbs = GridSearchCV(lin_clf,param_grid,cv=bs)
gridbs.fit(X_train_std,y_train)
Out[72]:
GridSearchCV(cv=ShuffleSplit(n_splits=15, random_state=0, test_size=0.3, train_size=None),
             error_score='raise-deprecating',
             estimator=LinearSVC(C=1.0, class_weight=None, dual=False,
                                 fit_intercept=True, intercept_scaling=1,
                                 loss='squared_hinge', max_iter=1000,
                                 multi_class='ovr', penalty='l2',
                                 random_state=None, tol=0.0001, verbose=0),
             iid='warn', n_jobs=None,
             param_grid={'C': [0.001, 0.01, 0.05, 0.1, 0.25, 0.5],
                         'penalty': ['l2']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [74]:
acc = cross_val_score(gridbs, X_train_std, y_train, cv=bs).mean()*100
acc
Out[74]:
73.13022578788802
In [75]:
y_pred = gridbs.predict(X_test_std)
target_names = ["disease", "no_disease"]
print(classification_report(y_test, y_pred, target_names=target_names))
              precision    recall  f1-score   support

     disease       0.70      0.80      0.75      8042
  no_disease       0.77      0.66      0.71      8125

    accuracy                           0.73     16167
   macro avg       0.74      0.73      0.73     16167
weighted avg       0.74      0.73      0.73     16167

In [76]:
gridbs.best_params_
Out[76]:
{'C': 0.01, 'penalty': 'l2'}
In [77]:
plot_roc_auc(y_test,y_pred)

non-linear kernel

In [78]:
rbf = SVC(C=0.1,kernel='rbf',gamma=0.1,cache_size=2000)
rbf.fit(X_train,y_train)
acc = cross_val_score(rbf, X_train_std, y_train, cv=bs).mean()*100
acc
Out[78]:
73.15053654243256
In [79]:
non_svm_pred = rbf.predict(X_test_std)
plot_roc_auc(y_test, non_svm_pred)

Tunning non-linear kernel

In [80]:
param_grid = {'C':[0.1,0.5,0.8,1], 'gamma':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]}
gridrbf = GridSearchCV(SVC(), param_grid)
gridrbf.fit(X_train_std, y_train)
nl_svm = gridrbf.predict(X_test_std)
nl_svm
/Users/mark/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py:1978: FutureWarning:

The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.

Out[80]:
array([1, 1, 0, ..., 0, 1, 0])
In [81]:
plot_roc_auc(y_test, nl_svm)
In [82]:
gridrbf.best_params_
Out[82]:
{'C': 1, 'gamma': 0.2}

Summary

C is a penalty value, more higher the C, less accepteable the error. As for Gamma, the greater the gamma, the fewer support vectors. According to gridsearch, I found the best parameter for linear SVM and non-linear SVM.

In [154]:
best_lin_svm = svm.LinearSVC(dual=False,C = 0.01, penalty = 'l2')
best_lin_svm.fit(X_train_std, y_train)
Out[154]:
LinearSVC(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)
In [151]:
best_nonlin_svm = SVC(C=0.8,kernel='rbf',gamma=0.1,cache_size=2000)
best_nonlin_svm.fit(X_train_std, y_train)
Out[151]:
SVC(C=0.8, cache_size=2000, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Random forest

Base Model

In [86]:
from sklearn.ensemble import RandomForestClassifier
num_trees = 100
rf = RandomForestClassifier(n_estimators=num_trees, max_features=10, random_state=16,bootstrap=True)
results = cross_val_score(rf, X_train_std, y_train, cv=cv)
print(f"Accuracy: {round(results.mean()*100, 2)}%")
Accuracy: 70.65%
In [87]:
rf.fit(X_train_std, y_train)
Out[87]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features=10, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=16, verbose=0,
                       warm_start=False)
In [88]:
rf_pred =rf.predict(X_test_std)
In [89]:
plot_roc_auc(y_test, rf_pred)

Tuning Model

In [90]:
from sklearn.model_selection import GridSearchCV
rf_param_grid = {'n_estimators': range(60, 160,20),
                 'max_depth':range(1, 10),
              'max_features': range(5, 10)}
rf_grid = GridSearchCV(RandomForestClassifier(), rf_param_grid)
best_rf_pred = rf_grid.fit(X_train_std, y_train)
/Users/mark/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py:1978: FutureWarning:

The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.

In [91]:
best_rf_pred = best_rf_pred.predict(X_test_std)
In [92]:
plot_roc_auc(y_test, best_rf_pred)
In [93]:
rf_grid.best_params_
Out[93]:
{'max_depth': 8, 'max_features': 7, 'n_estimators': 120}

Summary

n_estimators is the number of trees in the forest, max_depth is the maximum depth of the tree, max_features is the number of features to consider when looking for the best split. I can get the best parameter by gridsearch.

In [152]:
best_rf = RandomForestClassifier(n_estimators=120, max_depth =  8,max_features=7, random_state=16,bootstrap=True)
best_rf.fit(X_train_std, y_train)
Out[152]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=8, max_features=7, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=120,
                       n_jobs=None, oob_score=False, random_state=16, verbose=0,
                       warm_start=False)

Gradient Boosting Classifier

Base Model

In [96]:
from sklearn.ensemble import GradientBoostingClassifier as gbm
model = gbm(n_estimators=num_trees, random_state=16)
results = cross_val_score(model, X_train_std, y_train, cv=cv)
print(f"Accuracy for GBM: {round(results.mean()*100, 2)}%")
Accuracy for GBM: 73.77%

Tuning Model

In [162]:
from sklearn.model_selection import GridSearchCV
param_grid = {'learning_rate': np.arange(0.02, 0.1, 0.01),
              'n_estimators': range(60, 100,20),
              'max_features':range(1, 10),
              'max_depth': range(2, 5)}
clf = GridSearchCV(gbm(), param_grid)
clf.fit(X_train_std, y_train)
/Users/mark/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py:1978: FutureWarning:

The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.

Out[162]:
GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=GradientBoostingClassifier(criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_it...
                                                  random_state=None,
                                                  subsample=1.0, tol=0.0001,
                                                  validation_fraction=0.1,
                                                  verbose=0, warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'learning_rate': array([0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09]),
                         'max_depth': range(2, 5), 'max_features': range(1, 10),
                         'n_estimators': range(60, 160, 20)},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [163]:
print(f"Accuracy for best GBM: {round(clf.best_score_*100, 2)}%")
Accuracy for best GBM: 73.87%
In [164]:
for key, val in clf.best_params_.items():
  print(f"Best hyperparameter is {key}: {val}")
Best hyperparameter is learning_rate: 0.06999999999999999
Best hyperparameter is max_depth: 4
Best hyperparameter is max_features: 4
Best hyperparameter is n_estimators: 120
In [165]:
gbm_pred =clf.predict(X_test_std)
In [166]:
plot_roc_auc(y_test,gbm_pred)

Summary

As I can see from above model prediction, Gradient Boosting is the best model I can make, I adjusted learning rate, max_depth, n_estimators by gridsearch in order to optismate this model.learning rate shrinks the contribution of each tree by learning_rate, n_estimators are the number of boosting stages to perform, max_depth is maximum depth of the individual regression estimators.

In [167]:
best_gbm = gbm(n_estimators=80,learning_rate = 0.06, max_depth = 4,max_features = 4,random_state=16)
best_gbm.fit(X_train_std, y_train)
Out[167]:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.06, loss='deviance', max_depth=4,
                           max_features=4, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=80,
                           n_iter_no_change=None, presort='auto',
                           random_state=16, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

Data Valiadtion

In [168]:
testing_data = pd.read_csv('./Disease_Prediction_Testing.csv')
testing = pd.read_csv('./Disease_Prediction_Testing.csv')
In [169]:
testing_data
Out[169]:
ID Age Gender Height Weight High Blood Pressure Low Blood Pressure Cholesterol Glucose Smoke Alcohol Exercise
0 0 44 female 160 59.0 100 80 high normal 0 0 1
1 1 41 female 169 74.0 120 70 normal normal 0 0 1
2 2 63 male 168 84.0 120 80 normal high 0 0 1
3 3 55 female 158 108.0 160 100 normal normal 0 0 0
4 4 55 female 167 67.0 120 80 normal normal 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ...
20995 20995 56 male 174 80.0 130 90 normal normal 1 1 1
20996 20996 46 female 164 83.0 120 80 normal normal 0 0 1
20997 20997 54 female 153 92.0 120 90 normal normal 0 0 1
20998 20998 62 female 163 73.0 140 90 normal high 0 0 1
20999 20999 51 male 176 105.0 130 80 normal normal 1 0 1

21000 rows × 12 columns

In [170]:
testing_data =testing_data.join(pd.get_dummies(testing_data.Gender,prefix='Gender'))
testing_data =testing_data.join(pd.get_dummies(testing_data.Cholesterol,prefix='Cholesterol'))
testing_data =testing_data.join(pd.get_dummies(testing_data.Glucose,prefix='Glucose'))
testing_data = testing_data.drop(['ID','Gender','Cholesterol','Glucose'],axis=1)
scaler = StandardScaler()
scaler.fit(testing_data)
testing_data_std = scaler.transform(testing_data)
In [171]:
nb_pred = best_nb.predict(testing_data)
knn_pred = best_knn.predict(testing_data_std)
lin_pred = best_lin_svm.predict(testing_data_std)
non_pred = best_nonlin_svm.predict(testing_data_std)
rf_pred = best_rf.predict(testing_data_std)
gbm_pred = best_gbm.predict(testing_data_std)
result = pd.DataFrame({'ID':testing.ID,'Naive_Bayes':nb_pred,'KNN':knn_pred,'Linear_SVM':lin_pred,'Non_SVM':non_pred,
                           'Random Forest':rf_pred,'Gradient Boosting': gbm_pred})
In [172]:
result.to_csv('homework_3_YueyuanHe_results.csv')

Prediction Summary

Above all, I explored this dataset and then processed data for building models. From above six models' result, I got their results and auc for each model, I used some optimal model methods to improve performance of models, such as gridsearch. Finally, I think Gradient Boosting is the best model for now to predict whether a person gets disease or not, it's almost 73.87%.

In [ ]: